Final project for the Introduction to Data Science / Text as Data class
By Adelaida Barrera (adelaidabarrera@gmail.com), Natalia Mejía, (natimp555@gmail.com), Mariana Saldarriaga (m.saldarriaga15@gmail.com) and Isabel de Brigard (isabeldebrigard@gmail.com)
This semester we seem to be uninterruptedly glued to our screens. From zoom, to R, from moodle to social media, more and more of our days are spent in front of our phones or computers. But this trend, though exacerbated by the unusual conditions of the last year (ugh, yes, it has almost been a year) did not start with some flying rodent far away. Digital platforms have carved up more and more of our time, and seem to direct more and more of our actions. As policy students, we were interested in one instance where this seems to be happening very notably: the way twitter communicates, condenses, and shapes public discourse around salient policy issues. -Guess it’s not procrastination if you can call it research…
But the question “How does Twitter shape public discourse?” seemed a bit ambitious for finals and the four hours of daylight Berlin offers at the end of the year. So we decided to narrow our focus and explore the public discourse around feminism and gender issues revealed by a select (yes, select as in selected by us, more on this in a minute) group of twitter accounts of activists, political leaders, writers, and all around opinion shapers from Colombia. We wanted to flex our newly acquired web scraping and text mining muscles, as well as try our hand at some initial network analysis. It was a bumpy-but-fun road that mostly got us excited to keep at it, till we are fully fluent in data science (looking at you regular expressions).
In what follows you will find, first, a brief section on our sample. From how we chose the accounts and what a savage beast our data was at the beginning, to how we tried to tame it and what it looked like when finally we decided to take it for a spin. Then we will get down to business and, with the help of some bi-grams and topic modeling, ask what these accounts actually talk about. We will then attempt to understand how they relate to or differ from one another. You’ll find some scaling and some network analysis. Finally, with the help of sentiment analysis, we will explore how these tweeters feel about a couple of interesting and somewhat controversial topics.
Because we are in academia after all, a few caveats before we begin.Despite all this, we do think we can draw a couple of interesting conclusions from this first glimpse into our local twitterverse. These are the highlights:
(Aquí pondría un par de bulletpoints con algunas conclusiones que queden al final, si queda alguna.)
Our initial intuition was that certain twitter accounts shape public discourse and that gathering those would give us a balanced and relatively complete picture of what most twitter-talk was about. This is partly the idea behind the Cifras y conceptos opinion leaders panel, that traces the opinion of various individuals on a wide range of topics. These opinion leaders, they say, “differ from public opinion in general, because they are the ones who guide the climate of opinion, have the capacity for foresight and influence political issues and issues on the national agenda” (cita https://cifrasyconceptos.com/productos-panel-de-opinion/), an so tracing their points of view should be telling of more than their personal standing on a given topic.
So we dived into the twitterverse to see who came out to greet us. With a combination of research, personal experience and some calls to people in the Colombian political sphere we came up with 69 individuals and 39 institutional accounts that we felt had to be included if we were interested in what was being said about feminism and gender issues. This gave us an initial tweet count of upwards of XXX, which seemed like a decent amount of text to begin with.
But yes. We know. This is not a complete, balanced, objective picture of the public discourse on twitter on these issues. Moreover, there is no way to know from our data how biased or incomplete our sample is. We know. Remember the part about the research grant? Well, the phone still hasn’t rang. But we decided to keep going with what we had. This was our thinking: our agonizing about how bad our selection was only clarified further what our data science professors have been telling us since Stats I: fancy analytical tools only get you so far. If you actually want to be able to say something about the world, you need to work on your theory. Really work on it. But we felt this was an exercise about the tools we had learned. The tools, not the theory. And for that -to try our hand on a limited sample- we had enough. The rest was standard web scraping. Yes, that phrase actually makes sense to us now. We set up our API authorization and scraped away. And what a beautiful mess we got.
In our initial exploration of the data, we looked at the average tweets per individual account and the less recent tweet by account. We then plotted the frequency of tweets across time (left plot). This exploration showed that, because some of the accounts posted content much less frequently than others, the last 3.200 tweets of each account represented very different time periods. We thus limited our data to tweets from the last 6 months and plotted those (right plot). This produced a much more balanced sample, with 67 individual accounts and 116.402 tweets.
From this sample, we then removed 22 accounts from congresswomen. We decided to do this after having run a topic model with a random sample of 7000 tweets -which was already a stretch for our 2012 laptops. Although we knew this had implications for what we would be able to say about in our analysis later, these accounts had too much content pertaining to topics other than gender / feminism and including them would have made it even harder to get a sense of what public discourse around these issues actually is.
Finally, we restricted the institutional accounts to match the same period we had chosen for the individual ones, and ended up with 39 accounts and 25.941 tweets. Here, because the institutions we had chosen are explicitly dedicated to the topics we were interested in, there was no need to leave anyone out. Institutions are, well, more institutional…
With our data ready and the help of quantada, we created a corpus. Finally, text was data. And so we did what any text miner would do: we got our rags and buckets out, put our aprons on, and began cleaning.
We removed stop words (both those that come in the tm package, as some we compiled in our own list), punctuation, numbers, and symbols. Then we removed mentions: we were after the what is what, more than the who is who of Colombian feminist tweeter. (And we would get to connections later on, with the network analysis.) Next were hashtags. Here, again, we understood that this would limit our analysis somewhat, but we felt we had a solid theory based reason for it. So we had that going for us, which is nice. The reason is that hashtags tend to work globally, as a shortcut to the apparently borderless internet conversacion. And we felt including them might disrupt the picture of the more local discourse we were trying to paint. (Esto no creo que esté bien explicado.)
And then, we did it all over again for our institutional accounts. By this time, this project was beginning to feel a little like what we figure raising twins must be like: you do a lot of cleaning. And you do it all twice. But we had gotten this far. And we were finally ready to see what all these tweets were about.
We got our dfm for individual accounts and turned it into a stm corpus to run our topic model and defined 10 topics most prevalent in the tweets we had. And yes, we then did it again for the institutional accounts. We broadly identified what the 10 topics were for both the individual and the institutional accounts, although the model does not perfectly classify the documents according to our posterior interpretation. But all in all what we got seemed reasonable.
Our chosen institutions spend a lot of time tweeting about women in the public sphere (well, duh…). They also use twitter to talk about their institutional events and work, which was also to be expected. And then it gets more interesting. Institutional accounts talk almost as much about social policy, as they talk about violence. And both the armed conflict and the truth commission figure prominently.
The model also picked up the conversation about reproductive rights that was sparked by an attempt made mid November to re-criminalize the three grounds on which abortion is currently legal in Colombia. In line with the decision of the supreme court -which upheld its 2006 verdict de-criminalizing abortions under certain circumstances-, the accounts we chose talk about abortion in terms of rights and access.
Another cluster of discourse formed around the LGBT community and the pandemic. This is probably due to the escalation of police violence against the LGBT community in their efforts to enforce curfews put in place due to the pandemic. But in true institutional spirit, this topic includes more words about dialogue, than about accountability.
Individual accounts center on women’s rights, which seems fairly obvious. It is however interesting that discourse here seems focused still on achieving equality with respect to men, which might be an initial indication of how far the debate on gender is in Colombia.
As with the institutional accounts, violence features very prominently, but here it is mostly connected to the state.
Individual accounts also comment often on wider topics of national politics and public opinion, which was perhaps to be expected, but raised our concerns about the classification of the documents our model was able to do.
We got curious about how different occupations might affect the prevalence of these topics in each account. And since we had that information, we went ahead and made more plots:
A couple of interesting outcomes:
Then, with the help of the LDA model, we calculated the probability of each word being generated from each topic (betas) and the ‘per-document-per-topic probabilities’: the proportion of words from that document that are generated from each topic.
We plotted the whole thing, inspired by Julia Silge’s blog, which is awesome and which you can check out here: https://juliasilge.com/blog/evaluating-stm/
After running the LDA model we wanted to observe if the topics change in time, se we plotted the proportion of documents from each topic by week from July to December of 2020. We analyzed the events that occurred during this period to get a better understanding of the trends the data presents.
Now we move to positions! We want to know how each twitter account live in space and how they relate to each other. To do so, we scaled some accounts in one space based on the tweets discussing certain topics. The scale goes from -3 to 1, representing the political position of the account; it goes to -3 since the main part of the accounts tend to the left.
Analyzing the scale based on tweets discussing political topics, it is possible to observe that from the sample of 25 twitter accounts, only three of them position in the positive quadrant, one in 0 and the rest of them in the negative quadrant. These results imply that most of the twitter accounts selected, when discussing political topics, tend to the left. It is interesting to see the occupation of the extreme values and the neutral one, the most extreme value to the left is from an activist, to the right from an actress, and the neutral from a lawyer.
What do the extremes of the scale seem to represent?